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A method of combinatorial cassette mutagenesis was 
designed to readily determine the informational content 
of individual residues in protein sequences. The technique 
consists ot simultaneously randomizing two or three 
positions by oligonucleotide cassette mutagenesis, select- 
ing for functional protein, and then sequencing to deter- 
mine the spectrum of allowable substitutions at each 
position. Repeated application of this method to the 
dimer interface of the DNA-binding domain of A. repres- 
sor reveals that the number and type of substitutions 
allowed at each position are extremely variable. At some 
positions only one or two residues are functionally accept- 
able; at other positions a wide range of residues and 
residue types are tolerated. The number of substitutions 
allowed at each position roughly correlates with the 
solvent accessibility of the wild-type side chain. 



IT HAS BEEN MORE THAN ZO YEARS SINCE AnFINSEN AND HIS 
colleagues showed rhat the sequence of a protein contains all of 
the information necessary to specify the three-dimensional 
structure (/). However, the general problem of predicting protein 
structure from sequence remains unsolved. Part of the difficulty mav 
stem from the complexity of protein structures. Although some 200 
protein structures are known, no rules have emerged that allow 
structure to be related to sequence in any simple fashion (2). The 
problem is farther complicated by the nonuniformirv of the struc- 
tural information encoded in protein sequences. Some residue 
positions are important, and changes at these positions can tip the 
balance between folding and unfolding (3-7). Other residues are 
relatively unimportant in a structural sense and a wide range of 
substitutions or modifications can be tolerated at these positions (3, 

If only a fraction of the residues in a protein sequence contribute 
significantly to the stability of the folded structure, then it becomes 
important to be able to identify these residues. We now describe the 
results of genetic studies that allow the importance of individual 
residues in protein sequences to be rapidly determined. Specifically 
we determine the spectrum of functionally acceptable substitutions 
at residue positions near the dimer interface of the NH,-tcrminal 
domain of phage lambda (X) repressor (10). The NH^-terminal 
domain binds to operator DNA as a dimer, with dimerization 
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mediated by hydrophobic packing of a helix 5 of one monomer 
against a helix 5' of the other monomer (//) (Fig. 1, A and B) 
Without helix 5 there are no contacts between the subunits (Fig 
1C). By applying combinatorial cassette mutagenesis to the helix 5 
region, we find that the number and spectrum of allowable substitu- 
tions within helix 5 are extremely variable from residue to residue 
In most cases, this variability can be rationalized in terms of the 
fractional solvent accessibility of the wild-tvpe side chain 

General strategy. For our studies, we used a plasmid-bornc gene 
that encodes a functional, operator-binding fragment (residues 1- 
102) of X repressor (12). The binding of the 1-102 fragment to 
operator DNA depends on dimerization which, in turn, depends on 
the helix 5-hehx s' packing interactions (/ /, 13). Thus if a 1-IQ~> 
protein retains normal operator-binding properties, we can infer 
that it is able to dimerize normally. 

Mutagenesis of the helix 5 region was performed bv a combina- 
torial cassette procedure. One example of this method, in which 
codons 8s and 88 are mutagenized, is illustrated in Fig 2 On the 
top strand, the mutagenized codons are svnthesized widi equal 
mixtures ot all tour bases in the first two codon positions and an 
equal mixture of G and C in the third position. The resulting 
population of base combinations will include codons for each of the 
20 naturally occurring amino acids at each of the mutagenized 
residue positions. On the bottom strand, inosine is inserted at each 
randomized position because it is able to pair with each of the four 
conventional bases ( 14). The two strands are then annealed and the 
mutagenic cassette is ligated into a purified plasmid backbone 

To identify plasmids encoding functional protein, we selected 
transformants for plasmid-encoded resistance to ampicillin and for 
resistance to killing by cl~ derivatives of phage X. The latter selection 
requires that the cell express 1-102 protein that is active in operator 
binding ( 15). For each mutagenesis experiment, many independent : 
transformants were chosen, single-stranded plasmid DNA was ' 
purified, and the relevant region of the 1-102 gene was sequenced, j 
The resulting set of sequences provides a list of functionally j 
acceptable helix 5 residues. 

Substitutions in the helix 5 region. In separate experiments with ! 
different mutagenic cassettes, the codons for helix 5 residues 85 and 1 
88; 86 and 89; 90 and 91; 84, 87, and 88; and 84, 87, and 91 were 
mutagenized, and genes encoding active 1-102 proteins were 
selected. In some cases, the survival frequency was low. For example 
only 17 of 60,000 transformants passed the selection after random- 
ization of codons 84, 87, and 88. In this case, each active candidate 
was sequenced. By contrast, 1,200 of 50,000 transformants passed 
the selection in the mutagenesis of positions 86 and 89 ( 16) In this 
case, we picked 50 candidates for sequence analvsis. Overall 150 
active genes were sequenced (Table 1). [ n addition, we sequenced 
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approximately 40 genes that had been mi jnized, but not subject- 
ed lo a functional selection. These . ; as controls for the efficiency 
of mutagenesis and also provide examples of helix 5 mutations that 
result in inactive 1-102 proteins (Table 1). 

Many of the active sequences contain at least two residue changes 
compared to wild type. In principle, some of these changes could be 
compensator)'; for example, residue X might be functionally allowed 
at position 85 only in combination with residue Z at position 88. 
This cannot be generally true, however, because most residue 
changes at one position were recovered in combination with several 
different changes at the other position or positions. It is therefore 
likely that most substitutions that are functionally acceptable in 
multiply mutant backgrounds would also be allowed as single 
substitutions. In Fig. 3, we show the spectrum of functionally 
acceptable substitutions at residue positions 84 to 91. 

From the list of allowed substitutions, several conclusions may be 

Table 1. Sequences for the helix 5 region of active and inactive mutants 
obtained bv combinatorial cassette mutagenesis. Active mutants arc resistant 
to phage XKH54; these are grouped by cassette, with the wild-type sequence 
at the top of each group and randomized positions in boldface. Asterisks 
indicate sequences of mutants obtained in the absence of a functional 
selection. The activity of these mutants was subsequently determined by a 
screen. Numbers next to sequences indicate the number of times particular 
mutant sequences were obtained. Numbers at the tops of the columns 
indicate amino acid positions. The one- letter abbreviations for the amino 
acids are: A, Ala; C, Cvs; D, Asp; E, Glu; F, Phe; G, Glv; H, His; I, He; K, 
Lvs; L, Leu; M, Met;*N, Asn; P, Pro; Q, Gin; R, Argi S, Ser; T, Thr; V, 
Val; W,Trp; and Y, Tyr. 
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drawn concerning rhe . .rural requirements at various positions in 
helix 5. We iV consider these residue positions in order of 
decreasing "informational content,' 1 where this term is roughly 
defined as a value that decreases as the number of allowed substitu- 
tions increases. Thus, the informational content of a residue position 
is highest if only the wild- type amino acid is allowed and is lowest if 
each of the 20 naturally occurring amino acids is allowed. 

Positions 84 and 87 in particular stand out as having a high 
informational content. He appears to be die only acceptable residue 
at position 84. Both Met and Leu are residues of similar size and 
hydrophobicity, and arc the only two residues that appear to be 
functional at position 87. The side chains of IIe SJ and Met 87 form a 
major part of the helix-helix packing interaction at the dimer 
interface, where He 84 of one subunit packs against Met 87 of the 
other subunit, and vice versa (Fig. 4). This cluster of four residues 
also contacts the globular portions of the domain. Solvent accessibil- 
ity calculations by die method of Lee and Richards (17) show that 
the He 84 and Met 37 side chains are almost completely buried (92 to 
98 percent solvent inaccessible) in the structure of the dimer. We 
assume that replacement of He 84 or Met 87 with smaller side chains 
would diminish dimerization because hydrophobic and van der 
Waals interactions would be lost. In fact, mutant repressors contain- 
ing Ser 84 or Thr 87 are defective in dimerization (13, 18). Replacing 
lie 84 or Met 87 with larger residues would also be expected to be 
detrimental because substantial structural rearrangements would be 
required to accommodate larger side chains. 
Seven residues (Leu, He, Val, Thr, Cys, Ser, and Ala) are 
mmm ^ functionally acceptable at position 91. Aromatic residues, charged 
residues, and strongly hydrophilic residues are not found. The wild- 
85 90 type Val side chain is partially buried in the dimer structure, with the 
1 1 ^ Cy2 methyl group packing against the C51 methyl group of the 
He 84 side chain. Although some of the acceptable substitutions such 
as He and Thr could make equivalent packing contacts, others such 
as Ala and Ser could not. 

Nine residues (Trp, His, Met, Gin, Leu, Val, Ser, Gly, and Ala) 
are acceptable at position 90. There is a surprisingly large range in 
both the acceptable size and hydrophilicity of these side chains. This 
is especially true as the C0 methyl group of the wild-type Ala is 
almost completely buried in the structure of the dimer and, at first 
glance, it would appear that larger side chains could not be 
accommodated. However, the inaccessibility of the C0 methyl 
group of Ala 90 is largely caused by the Lys 67 side chain, which packs 
against it. By rotating the Lys 67 ' side chain away, we were able to 
introduce a Trp 90 side chain by model-building without steric 
clashes. Rotation of the Lys 67 side chain away from Ala** should 
not be energetically costly and, in fact, is observed in crystals of the 
NH 2 -terminaI domain bound to operator DNA (19). 

Nine different residues (Trp, Tyr, Phe, Met, He, Val, Cys, Ser, and 
Ala) are functionally acceptable at position 88. There are large 
variations in the sizes and volumes of the acceptable side chains, 
although most are relatively hydrophobic. Charged residues and 
other strongly hydrophilic residues are not observed. In the wild- 
type dimer ( 11), the aromatic ring of Tyr 88 stacks against the ring of 
Tyr 88 '. The side chains of Trp, Phe, Met, He, and Val could probably 
form some type of packing interaction at this position, although 
those of Ala and Ser could not. It is known that the presence of Cys 
at position 88 allows a stable Cys 88 -Cys 88 ' disulfide bond, which 
links the monomers in a conformation that is active in operator 
binding (20). 

Positions 85, 86, and 89 show considerable variability. At each of 
these positions, 13 different amino acids were found to function. At 
positions 85 and 86, aromatic, hydrophobic, polar, and charged 
residues are all acceptable. At position 89, aromatic residues were 
not represented, but each of the remaining classes was observed. In 
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rotated 90° from the view in (A), to .show the ~back side" of the molecule. 
(C) Dimcr with helix 5 of each monomer removed. This view illustrates the 
role helix 5 plays in mediating dimerization {26). 



Fig. 1. Three views of the DNA-binding domain of \ repressor, showing the 
role of helix 5 in dimerization. (A) Proposed complex of repressor dimcr 
with operator DNA ( / /). Helix 5 of each monomer is colored more li*htlv 
than the globular portion of that monomer. (B) Free repressor dimcr, 

Fig. 2. Schematic diagram showing the combinatorial cassette mutagenesis 
procedure. At positions indicated as N, an equal mixture of A, G, C, and T 
was used during oligonucleotide synthesis. At positions indicated as I, 
inosinc was used. After synthesis, the oligonucleotides were phosphor/latcd" 
annealed, and ligared into the Xho I-Sph I backbone of plasmid pJOl()3! 
Plasmid pJOI03 is an M13 origin plasmid with the 1-102 gene under 
control of a Mr promoter; the region of the 1-102 gene encoding residues 
82-93 (the small Xho I-Sph I fragment) is replaced^ an unrelated I.9-kb 
Xho l-Sph I *stufliT n fragment. Ligated DNA was transformed into 
Izsihtruhui ioli strain X90 FW cells (27), and ampicillin-resistant colonies 

were selected in the presence or absence of phage \KH54. Candidates that xho i s P » i 
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the wild-type- dimcr, the side chains of Tyr 8 \ Glu™, and Glu* 19 arc 
relatively solvent accessible. 

Several amino acids arc significantly underrepreseiued among the 
active sequences. For example, Pro is never found. This cannot be an 
artifact of our mutagenesis procedure because Pro is frequently 
observed among die unsclccted mutant sequences (Table 1). We 
conclude that Pro is not found among the functional sequences 
because it is selected against; its presence would presumably disrupt 
the a-helical structure, and thereby the helix-helix packing at rhe 
dimer interface. 

His, Asn, and Lys are also underrepresented among the functional 
helix 5 sequences. These residues are presumably not acceptable at 
positions 84 and 87, where the informational content is extreme! v 
high, and may not be acceptable at positions 88 and 91, where the 
functional substitutions arc generally hydrophobic in character. The 
acceptability of these residues at positions such as 85 and 86 is 
difficult to assess from our experiments because the codons for these 
residues are present at reasonably low frequencies even among the 
unsclccted sequences. In these cases, we probably have not se- 
quenced a large enough number of candidates to be confident that 
all acceptable substitutions have been identified. In fact, data from 
reversion studies (21) and suppressed amber studies (22) show chat 
His' • and Lys*'' are acceptable substitutions in the context of rhe 
intact \ repressor molecule. 

Informational content and protein structure. We have com- 
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Fig. 3. Functionally aeceprable residues in rhe helix 5 region. The amino 
acids are listed from cop to bottom in order of' increasing Iwdrophobicitv 
according to rhe scale of Eisenberg ct .i/. (JO). 



bined an efficient combinatorial mutagenesis procedure and a 
functional selection to probe the informational content of the eight 
residues that form die major part of the dimerization interface of the 
NH : -terminaI, operator-binding domain of k repressor. At two of 
these eight residue positions, rhe functionally acceptable choices arc- 
highly restricted. For example, we analyzed 17 functional genes in 
which codon 84 had been randomized and recovered rhe wild-type 
residue, He, in even' case. This is clearly a position of high 
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Fig.* 4. Helix 5 residues high 
in intbmiarional content. The 
r\vo isolated helix 5 regions of 
the protein are shown in 
green and blue. He 84 and 
Met 87 from the green helix are 
shown in yellow; lie 8 * 4 and 
Met 87 from' the blue helix are 
shown in red. 




sit ?iain is almost completely buried, we 
iV acceptable residue choices are extremely 
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Fig 5 Correlation between the solvent accessibility and the number ol 
nanctionallv acceptable substitutions. Hatched bars indicate the percentage 
of the 20 narurallv occurring amino acids that are functionally acceptable at a 
residue position. Black bars indicate the fractional solvent accessibility o the 
wild-tvpc side chain in the dimcr. Solvent accessibilities tor the NH r 
tcrminal domain dimcr (//) were computed using a 1.4 A probe by the 
metlHxJ of Lee and Richards ( 17). Fractional accessibilities were obtained by 
dividing bv the appropriate side chain accessibilities calculated tor the 
monomer. The fractional accessibilities change only slightly it the side chain 
accessibilities in the reference tripeptide Ala-X-Ala [17) are used instead as 
the reference state. 

informational content. The informational content is also high at 
position 87, where Met and Leu arc the only acceptable residues. By 
contrast, the remaining positions have moderate to low informa- 
tional contents. For example, among 38 functional genes in which 
codon 85 had been randomized, the wild-type residue was recovered 
only once, and 12 other residues, differing in size and chemical 
properties, were recovered in the remaining cases. This is clearly a 
position of low informational content. It is striking that most ot the 
structural determinants of dimcrization in this eight-residue seg- 
ment reside in two residues only. The remaining positions are 
surprisingly tolerant of a wide range of substitutions. If this high 
level of tolerance is generally true of protein sequences, then the 
problem of understanding and predicting structure may rest largely 
on the ability to identify those few residues that arc crucial. 

The positional variability of the informational content in helix 5 
can, in general, be rationlized in terms of the solvent accessibility ot 
the wild- type residues in the crystal structure ( / / ). There is a rough 
correlation between the number of acceptable substitutions and the 
fractional extent to which the wild-type side chain is solvent 
accessible ( Fig. 5). At exposed surface positions such as 85, 86, and 
89, we find that many different residues and residue types can be 
furictionally accommodated. By contrast, at positions such as 84 and 
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restricted. There is one apparent exception to the simple rule that 
buried residues are high in informational content. Ala™ is inaccessi- 
ble to solvent in the crystal structure, and yet we tind that many 
substitutions are allowed at this position. However, the inaccessibi- 
lity of the Ala w side chain to solvent is nor due to close packing at 
the dimcr interface, but rather to an interaction with a nearby 
surface side chain. This side chain can presumably move to allow 
larger side chains to be accommodated at position 90. Examples of 
this type demonstrate die need to distinguish between two types ot 
buried side chains: those diat can become exposed by relatively 
minor rearrangement of other side chains, and those that are tightly 
packed in die hydrophobic core. 

There is no reason to assume that there should always be a strict 
correlation between the solvent accessibility of a residue and the 
structural informational content of that position. For one thing, the 
chemical properties of the 20 amino acids are not related in any 
simple linear fashion. Moreover, the structural importance ot some 
residues in proteins almost certainly stems from interactions other 
than simple hydrophobic packing. Nevertheless, the closely packed 
nature of protein interiors (23) provides a simple molecular explana- 
tion for the structural importance of buried residues, and destabiliz- 
ing mutations are commonly found to affect hydrophobic core 
residues (J-7). Bv contrast, missense mutations or chemical modifi- 
cations that affect surface residues are often found to have little or no 
influence on protein stability (3, 7, 8). Thus, it is reasonable that 
solvent accessibility should be an extremely important determinant 
of the informational content of a residue position. 

Our overall strategy for rapidly probing informational content 
should be broadlv applicable to a wide range of protein structurc- 
ftinction problems in svstems where genetic selections or screens can 
be devised. The mediod consists of three basic elements: (i) the use 
of cassette mutagenesis to introduce extremely high levels ot target- 
ed random mutagenesis; (ii) the use of a functional selection to 
identify genes encoding active proteins; and (iii) the use ot rapid 
DNA sequencing methods to determine the spectrum ot functional- 
ly acceptable residues in a relatively large number of candidates. Our 
method of combinatorial cassette mutagenesis (Fig. 2) allows several 
residue positions to be mutagenized at the same time and, in 
principle, generates a mutant population in which each ot the 20 
ammo acids is represented at each mutagenized position (24). When 
two or three codons are mutagenized at the same time, the entire 
analysis is able to proceed more rapidly. Moreover, at this level ot 
mutagenesis most two-residue and three-residue combinations 
should be present in the mutagenized population and should be 
recovered if thev result in a functional protein. In our study ot the 
packing of the 84 and 87 side chains, we recovered only two (He 
with Met 87 and He 84 with Leu 87 ) of the 400 possible residue 
combinations. Thus, because both positions were mutagenized in 
the same experiment, we are able to conclude that there are not 
significantly different ways of packing the dimer interface. 

In principle, data like that shown in Fig. 3 could be generated tor 
an entire protein sequence, and additional experiments could be 
devised to determine whether the positions of high informational 
content were important for structure or hmction. For proteins ot 
unknown structure, such data might be quite useful for structural 
predictions. First, current predictive algorithms could be applied to 
the familv of related sequences generated by our method, as each ot 
these sequences is able to form the same basic structure. Second, 
because of their Rindamental repeats, a-helical and (3-strand regions 
might be recognized bv characteristic patterns ot high and low 
informational content. Third, the positions of highest structural 
informational content should include the residues involved in 
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* t formation of the hydrophobic core of the r " >tein. This information 
might prove useful in combination v : -h , .ertiary template ideas 
fcccArly proposed (25). 
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